skip to main content


Search for: All records

Creators/Authors contains: "Guo, Pei"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics. 
    more » « less
  2. Network interpretation as an effort to reveal the features learned by a network remains largely visualization-based. In this paper, our goal is to tackle semantic network interpretation at both filter and decision level. For filter-level interpretation, we represent the concepts a filter encodes with a probability distribution of visual attributes. The decision-level interpretation is achieved by textual summarization that generates an explanatory sentence containing clues behind a network’s decision. A Bayesian inference algorithm is proposed to automatically associate filters and network decisions with visual attributes. Human study confirms that the semantic interpretation is a beneficial alternative or complement to visualization methods. We demonstrate the crucial role that semantic network interpretation can play in understanding a network’s failure patterns. More importantly, semantic network interpretation enables a better understanding of the correlation between a model’s performance and its distribution metrics like filter selectivity and concept sparseness. 
    more » « less
  3. null (Ed.)
    The Arctic sea ice has retreated rapidly in the past few decades, which is believed to be driven by various dynamic and thermodynamic processes in the atmosphere. The newly open water resulted from sea ice decline in turn exerts large influence on the atmosphere. Therefore, this study aims to investigate the causality between multiple atmospheric processes and sea ice variations using three distinct data-driven causality approaches that have been proposed recently: Temporal Causality Discovery Framework Non-combinatorial Optimization via Trace Exponential and Augmented lagrangian for Structure learning (NOTEARS) and Directed Acyclic Graph-Graph Neural Networks (DAG-GNN). We apply these three algorithms to 39 years of historical time-series data sets, which include 11 atmospheric variables from ERA-5 reanalysis product and passive microwave satellite retrieved sea ice extent. By comparing the causality graph results of these approaches with what we summarized from the literature, it shows that the static graphs produced by NOTEARS and DAG-GNN are relatively reasonable. The results from NOTEARS indicate that relative humidity and precipitation dominate sea ice changes among all variables, while the results from DAG-GNN suggest that the horizontal and meridional wind are more important for driving sea ice variations. However, both approaches produce some unrealistic cause-effect relationships. Additionally, these three methods cannot well detect the delayed impact of one variable on another in the Arctic. It also turns out that the results are rather sensitive to the choice of hyperparameters of the three methods. As a pioneer study, this work paves the way to disentangle the complex causal relationships in the Earth system, by taking the advantage of cutting-edge Artificial Intelligence technologies. 
    more » « less
  4. Data-driven causality discovery is a common way to understand causal relationships among different components of a system. We study how to achieve scalable data-driven causal- ity discovery on Amazon Web Services (AWS) and Microsoft Azure cloud and propose a causality discovery as a service (CDaaS) framework. With this framework, users can easily re- run previous causality discovery experiments or run causality discovery with different setups (such as new datasets or causality discovery parameters). Our CDaaS leverages Cloud Container Registry service and Virtual Machine service to achieve scal- able causality discovery with different discovery algorithms. We further did extensive experiments and benchmarking of our CDaaS to understand the effects of seven factors (big data engine parameter setting, virtual machine instance number, type, subtype, size, cloud service, cloud provider) and how to best provision cloud resources for our causality discovery service based on certain goals including execution time, budgetary cost and cost-performance ratio. We report our findings from the benchmarking, which can help obtain optimal configurations based on each application’s characteristics. The findings show proper configurations could lead to both faster execution time and less budgetary cost. 
    more » « less
  5. null (Ed.)
  6. null (Ed.)
  7. Granger causality and its learning algorithms have been widely used in many disciplines to study cause-effect relationship among time series variables. In this paper, we address computing challenges of state-of-art Granger causality learning algorithms, specially when facing increasing dimensionality of available datasets. We study how to leverage gradient boosting meta machine learning techniques to achieve accurate causality discovery and big data parallel techniques for efficient causality discovery from large temporal datasets. We propose two main algorithms for gradient boosting based causality learning, and parallel gradient boosting based causality learning. Our experiments show our proposed algorithms can achieve efficient learning in distributed environments with good learning accuracy. 
    more » « less